Skip to content

fix(dflash): auto-detect GPU arch to prevent sm_120a on consumer Blackwell#48

Open
easel wants to merge 1 commit intoLuce-Org:mainfrom
easel:fix/consumer-blackwell-auto-detect
Open

fix(dflash): auto-detect GPU arch to prevent sm_120a on consumer Blackwell#48
easel wants to merge 1 commit intoLuce-Org:mainfrom
easel:fix/consumer-blackwell-auto-detect

Conversation

@easel
Copy link
Copy Markdown
Contributor

@easel easel commented Apr 27, 2026

Problem

On CUDA 13.2+ with consumer Blackwell GPUs (for example, RTX 5090, SM 12.0), using an unset
CMAKE_CUDA_ARCHITECTURES or native can resolve to sm_120a instead of
sm_120, which can trigger CUDA_ERROR_ILLEGAL_INSTRUCTION at runtime on
consumer hardware.

Fix (auto-detect only)

  • At configure time, if CMAKE_CUDA_ARCHITECTURES is unset or native, run
    nvidia-smi --query-gpu=compute_cap --format=csv,noheader.
  • Parse the compute capability (for example 12.0) and set
    CMAKE_CUDA_ARCHITECTURES explicitly (for example 120).
  • Keep the change isolated to dflash/CMakeLists.txt so it can be reviewed and
    merged independently from consumer-specific workaround behavior.

Test plan

  • cmake -B build -S dflash/ prints dflash27b: GPU compute_cap 12.0 → CUDA_ARCHITECTURES=120 on Blackwell hardware.
  • cmake --build build succeeds without CUDA-arch related compiler/runtime errors on a Blackwell consumer system.

@davide221
Copy link
Copy Markdown
Contributor

@easel thanks for the contribution! Is the speed problem still present ?

@easel
Copy link
Copy Markdown
Contributor Author

easel commented Apr 28, 2026

@easel thanks for the contribution! Is the speed problem still present ?

Yes. I think it's related to the workflow -- I'm putting together a small benchmark script to compare.

@easel easel force-pushed the fix/consumer-blackwell-auto-detect branch from e6dc0cd to 858b84b Compare May 4, 2026 20:24
@easel
Copy link
Copy Markdown
Contributor Author

easel commented May 4, 2026

This may not be necessary if expectation is to always build multi-arch binary. I ran into it because claude got excited about optimizing and ended up with a slightly incompatible build.

javierpazo added a commit to javierpazo/lucebox-hub that referenced this pull request May 9, 2026
Two CMake-side rough edges that bit me on Windows MSVC + CUDA 12.x
on RTX 6000 Ada (sm_89, Ada-only):

  1. CUDA architectures: when no explicit override is provided,
     the previous CMakeLists could fall back to `75;86`, which
     caused silent build issues on Ada-only setups. This change
     respects DFLASH27B_USER_CUDA_ARCHITECTURES (e.g. `89`) and
     uses it consistently across the dflash and submodule
     ggml/llama.cpp consumers.

  2. BSA was sometimes silently disabled depending on detection
     order. DFLASH27B_ENABLE_BSA is now respected as an explicit
     opt-in/opt-out and a clear status line is printed at
     configure time.

Net effect: a single-arch Ada-only build with BSA enabled is
reproducible from a clean checkout. Default behaviour (no
DFLASH27B_USER_CUDA_ARCHITECTURES set, BSA on) is preserved for
existing users.

Validation:
  cmake -S dflash -B dflash/build/Release \
    -DCMAKE_BUILD_TYPE=Release \
    -DDFLASH27B_USER_CUDA_ARCHITECTURES=89 \
    -DDFLASH27B_ENABLE_BSA=ON
  cmake --build dflash/build/Release --target test_dflash --parallel 8
  -> BUILD_EXIT_CODE=0, sm_89 single-arch confirmed.

Verification vs existing community PRs:

  COMP-COMPL with Luce-Org#48 ("auto-detect GPU arch to prevent sm_120a on
  consumer Blackwell", open) and Luce-Org#91 ("expose BSA config as CLI
  flags with safety warnings", merged 2026-05-04). Luce-Org#48 covers
  auto-detect; Luce-Org#91 covers runtime CLI. This PR covers the
  build-time CMake side: respect the user's explicit
  DFLASH27B_USER_CUDA_ARCHITECTURES override and keep
  DFLASH27B_ENABLE_BSA honest. The three PRs together give
  sensible defaults per hardware tier.

Author: Javier Pazo <xabicasa@gmail.com>
javierpazo added a commit to javierpazo/lucebox-hub that referenced this pull request May 10, 2026
Two CMake-side rough edges that bit me on Windows MSVC + CUDA 12.x
on RTX 6000 Ada (sm_89, Ada-only):

  1. CUDA architectures: when no explicit override is provided,
     the previous CMakeLists could fall back to `75;86`, which
     caused silent build issues on Ada-only setups. This change
     respects DFLASH27B_USER_CUDA_ARCHITECTURES (e.g. `89`) and
     uses it consistently across the dflash and submodule
     ggml/llama.cpp consumers.

  2. BSA was sometimes silently disabled depending on detection
     order. DFLASH27B_ENABLE_BSA is now respected as an explicit
     opt-in/opt-out and a clear status line is printed at
     configure time.

Net effect: a single-arch Ada-only build with BSA enabled is
reproducible from a clean checkout. Default behaviour (no
DFLASH27B_USER_CUDA_ARCHITECTURES set, BSA on) is preserved for
existing users.

Validation:
  cmake -S dflash -B dflash/build/Release \
    -DCMAKE_BUILD_TYPE=Release \
    -DDFLASH27B_USER_CUDA_ARCHITECTURES=89 \
    -DDFLASH27B_ENABLE_BSA=ON
  cmake --build dflash/build/Release --target test_dflash --parallel 8
  -> BUILD_EXIT_CODE=0, sm_89 single-arch confirmed.

Verification vs existing community PRs:

  COMP-COMPL with Luce-Org#48 ("auto-detect GPU arch to prevent sm_120a on
  consumer Blackwell", open) and Luce-Org#91 ("expose BSA config as CLI
  flags with safety warnings", merged 2026-05-04). Luce-Org#48 covers
  auto-detect; Luce-Org#91 covers runtime CLI. This PR covers the
  build-time CMake side: respect the user's explicit
  DFLASH27B_USER_CUDA_ARCHITECTURES override and keep
  DFLASH27B_ENABLE_BSA honest. The three PRs together give
  sensible defaults per hardware tier.

Author: Javier Pazo <xabicasa@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants